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ABS1BACT 

Computation  bound  problems  impose  a  severe  burden  on 
the  CEU.  In  order  to  speed  up  computation,  specific  problems 
that  are  identified  as  the  main  burden  can  be  done  using 
parallel  processing.  In  this  way,  the  time  consuming  tasks 
can  be  executed  on  specially  tailored  hardware.  This  hard- 
ware is  designed  to  implement  an  algorithm-oriented 
parallel-processing  structure  that  works  more  efficiently 
than  the  CPU  for  these  specific  tasks.  This  thesis  is  a 
study  of  the  mapping  cf  the  algorithms  onto  a  kind  cf  struc- 
ture called  systolic  array.  The  development  and  utilization 
of  a  software  tool  designed  to  assist  on  such  analysis  is 
presented  here.  This  tool,  named  Systolic  Array  Graphics 
Simulator  (SYSGRAS) ,  has  the  capability  to  represent  any 
type  cf  systolic  array,  no  matter  how  complex  the  cells  and 
structure  are.  Because  of  the  capability  of  SYSGF.AS,  an 
interactive  computer  program  simulator,  the  study  of 
systolic  arrays  is  simplified.  The  complexity  of  the  time- 
space  relationships  is  analyzed  with  the  help  of  seme  color 
graphics  techniques.  The  visualization  of  the  data  interac- 
tion is  thus  enhanced  and  the  user  is  alleviated  from  the 
burden  of  keeping  track  of  partial  results  and  can  dedicate 
attention  to  the  processing  algorithm. 
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I.  THE  EEFINITIOB  OF  THE  PROBLEM 

A.  A  FIEST  IHSIGHT 

"A  systolic  system  is  a  network  of  processors  which  rhyth- 
mically compute  and  pass  data  through  the  system. 
Physiologists  use  the  word  "systole"  to  refer  to  the 
rhythmically  recurrent  contraction  of  the  heart  and 
arteries  which  pulses  blood  through  the  body.  In  a 
systolic  computing  system,  the  function  of  a  processor  is 
analogous  to  that  of  the  heart.  Every  processor  regularly 
pumps  data  in  and  out,  each  time  performing  some  short 
computation,  so  that  a  regular  flow  of  data  is  kept  in  the 
network. " 

"...matrix  computations  can  be  pipelined  elegantly  and 
efficiently  on  systolic  networks  having  an  array 
structure." 

"...special  purpose  hardware  devices  based  on  systolic 
arrays  can  te  built  inexpensively  using  the  VLSI 
technology." 

These  are  parts  of  the  abstract  of  the  paper  where 
Professors  H.T.  Kung  and  C.E.  Leiserson  have  first 
presented  the  concept  of  systolic  arrays  [Bef.  1].  They 
fully  describe  the  concept  and  context  to  which  the  term 
"systolic  array"  applies. 

Several  authors  have  presented  papers  on  systolic  arrays 
in  recent  years,  but  this  subject  is  very  new  and  everything 
dates  back  to  1978. 

B.  SISTCLIC  ABBAYS:  1BE  BASIC  PRINCIPLE 

A  systolic  system  consists  of  a  set  of  interconnected 
cells,   each   capable  of   performing  some   simple  operation. 


Information  flows  between  cells  in  a  pipelined  fashion,  and 
communication  with  the  outside  world  occurs  only  at 
"boundary  cells". 

The  usual  computational  tasks  are  basically  divided  into 
two  categories — computing  bound  and  input/output  (I/O) 
bound-  If  the  total  number  of  computing  operations  is  much 
larger  than  that  of  I/O  operations,  then  we  have  a  computing 
bound  task.  Systolic  arrays  are  useful  to  speed  up  this 
type  cf  task.  The  array  is  able  to  use  each  input  element  a 
number  cf  times  thus  achieving  a  high  computational 
throughput  without  increasing  memory  bandwidth.   [Bef.  2] 

C.   DIVISEE  AECHITECTEBES 

A  number  of  architectures  have  already  been  proposed  for 
solving  the  traditional  problems.  These  are  linear  arrays 
for  performing  matrix-vector  multiplication  and  finding 
solution  of  triangular  linear  systems;  two-dimensional 
arrays  for  matrix-matrix  multiplication/addition,  least 
squares  solution  and  ID  factorization, etc  [Eef.  3]. 

E.   HAB*£BE  EXPLOBATOBY  DEVELOPMENT 

Because  this  subject  is  new,  much  of  the  development  is 
still  in  an  exploratory  stage.  In  order  to  further  evaluate 
the  hardware  potential  of  systolic  arrays,  TEK/ESI  has 
developed  a  processor  called  Fhoenix  that  has  already  been 
used  to  provide  two-dimensional  FIR  filters  for  image 
processing  [Eef-  3:  page  3]-  To  evaluate  systolic  algo- 
rithms and  architectures,  the  Naval  Ocean  System  Center  has 
developed  a  reconf igurable  one-two  dimensional  systolic 
array  with  64-processing  elements  £Bef.  4  ]•  This  systolic 
testbed  processor  can  be  reconfigured  under  software  control 
to  perform  as  a  rectangular,  hexagonal,  or  linear  systolic 
array.    The    main    goal   of   its   design   was   to  achieve 
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flexibility  in  algorithm  evaluation,  and  its  cells  are  much 
slower  than  those  used  in  the  TRW/ESL  Phoenix  processor  or  a 
possible  implementation  in  a  VISI  chip. 

X.   HHI  HOT  TO  EXPLORE  IT  WITH  SOFTWARE? 

During  the  preliminary  investigations  in  this  field,  the 
approach  was  to  select  a  case  study  and  try  to  go  deeper  in 
the  process  of  understanding  the  way  a  particular  algorithm 
is  mapped  onto  a  systolic  architecture.  It  was  discovered 
that  it  is  not  always  straightforward  to  understand  the 
mapping  process.  An  algorithm  like  the  one  proposed  by  H.T. 
Fung  and  W.M.  Gentleman  for  doing  matrix  triangularization 
can  he  really  difficult  to  understand  if  one  dees  not 
possess  an  adequate  tcol  [Sef.  7:  page  22].  The  algebraic 
manipulation  can  be  difficult  when  dealing  with  systolic 
arrays  because  the  interaction  of  the  processors  tend  to 
lead  to  a  very  complicated  set  of  equations  in  only  a  few 
steps  and  it  becomes  difficult  to  achieve  any  generaliza- 
tion. Seme  notations  have  been  proposed  to  help  in  the 
modeling  the  parallel  processing  £Ref.  5 ]#  but  no  systematic 
method  for  mapping  an  algorithm  onto  an  array  exists.  There 
is  a  lack  of  the  appropriate  tools  for  systolic  algorithm 
development.  However,  with  the  development  of  graphics 
technclcgy  a  means  now  exists  that  can  be  used  to  help 
people  tetter  understand  how  a  systolic  array  works.  If  we 
conjugate  this  possibility  with  interactive  programming 
technigues,  it  may  be  possible  provide  an  user  friendly  tool 
that  can  be  used  tc  achieve  the  goal  which  is  the  main 
guideline  of  this  investigation:  to  provide  an  easier  way 
to  understand  systolic  array  operation.  A  case  study  will 
be  presented  in  Chapter  III  that  turned  towards  performing 
Matrix  Triangularization  with  the  Givens  Rotations 
Algorithm.    Eventually,   the   systolic   processing   can   be 
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understood  and  checked  against  real  data  with  the  use  of  our 
Systolic  Array  Graphics  Simulator  (SYSGRAS) .  This  approach, 
in  contrast  with  that  followed  by  researchers  at  the 
B.O.S.C.,  in  San  Diego,  is  to  take  full  advantage  of  the 
software,  using  computer  graphics  presentation  to  improve  or 
to  add  another  dimension  to  the  man/machine  interface. 

Anyway  we  want  to  point  out  that  this  work  must  fce  seen 
as  an  investigation  of  the  possibilities  of  using  this  kind 
of  facilities  as  tools  rather  than  as  a  final  product.  Only 
experimentation  in  practical  work  can  show  the  advantages  or 
the  limitations  of  this  approach.  As  in  any  other  area  of 
engineering,  feedback  from  users  is  important  in  the  evalua- 
tion and  future  improvement  of  this  tool. 
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II.  THE  SIMULATOR  APPROACH 

The  Systolic  Array  Graphical  Simulator  (SYSGRAS)  is 
implemented  to  allow  interactive  design  of  any  type  of 
systolic  array.  The  cells  and  their  interconnection  links 
are  defined  by  the  user  in  a  guided  manner.  The  optional 
cell  types  are  built  into  the  SYSGRAS.  The  user  can 
construct  any  array  using  the  existing  cell  types  or  by 
introducing  new  cell  types,  tut  the  introduction  of  a  new 
cell  type  cannot  be  done  interactively.  This  reguires  a 
subroutine  to  describe  and  support  that  type,  but  the  opera- 
tion is  simple  and  will  be  explained  later.  The  presently 
available  types  can  support  most  of  the  algorithms  found  in 
the  literature. 

SYSGRAS  is  designed  with  a  reguirement  of  being  user 
friendly.  It  has  been  realized  that  the  amount  of  data  that 
is  normally  reguired  to  be  entered  by  a  user  in  a  session  is 
guite  large.  It  is  important  to  offer  the  possibility  of 
correcting  errors,  reviewing  entered  data,  and  storing  data 
and  results  from  the  session.  It  can  also  be  used  later  in 
another  session.  The  sequence  of  operations  required  during 
a  session  is  shown  below: 

1)  Establishment  cf  general  conditions: 

new  or  previous  protlem  to  run. 

need   to   store   the   calculations   for    later 

printing. 

ability   tc  interrupt   or  review   the  progress  of 

systolic  processing  on  a  clocked  basis. 

2)  Definition  of  cells'  attributes: 

type  of  processor  in  the  cell. 
graphical  shape  of  the  cell, 
screen  coordinates  cf  the  cell. 
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interfaces  (importer  cell,  receiver  of  external 
data;  internal  cell,  only  exchanges  data  with 
other  cells;  exporter  cell,  data  transmitter 
to  external  world) . 

3)  Definition  of  links: 

define  the  origin  and  destination  of  each  link. 
The  origin  is  a  port  in  a  cell.  Each  cell  has  a 
number  of  output  and  input  ports.  The  function 
of  each  one  depends  on  the  processor  type. 
The  destination  is  a  input  port  in  another 
cell.  A  link  is  defined  by  origin  cell 
output  port  and  the  destination  cell  input 
port. 

4)  Presentation  on  the  graphical  screen: 

-  the  picture  of  the  user  defined  array  is 
initially  presented  on  the  screen  with  all 
processor  registers  initialized  to  zero.  Since 
the  presentation  is  strongly  based  on  colors, 
the  non  existence  of  data  interaction  in  the 
cells  makes  them  all  turn  into  black.  The 
identification  number  given  by  the  user  to 
each  cell  is  placed  at  the  cell's  lower  light 
corner  and  red  arrows  indicate  the  licking 
between  cells  and  the  direction  of  data  flux 
(see  Fig.  4.7  at  Chapter  IV). 

5)  Data  input  and  screen  update: 

as  soon  as  data  originated  from  the  external 
world  to  the  importer  cells  is  entered,  the 
whole  screen  is  updated  and  representing  the 
status  of  the  systolic  array  at  that  partic- 
ular clock  cycle.  The  external  data  are  entered 
at  each  clock  cycle,  and  the  screen  is  updated 
accordingly. 
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6)  Review  of  the  complete  systolic  process: 

optionally,  the  user  can  review  the  complete 
process.  The  tctal  data  generated  in  a 
session  are  stored  automatically,  and  the 
graphics  presentation  of  any  previous  session 
can  be  seen  again  without  any  additional 
data  entry. 

The  central  idea  of  the  SYSGRAS  is  to  allow  a  user  to 
construct  any  desired  systolic  array  and  to  use  as  the  input 
test  data  such  that  insight  can  be  gained  about  the  data 
interactions  of  the  systolic  algorithm.  The  user  has  the 
capability  to  correlate  the  propagation  of  the  input  data 
with  the  colors  of  the  cells  to  better  understand  the 
process.  Certainly,  the  greater  the  creativity  of  assigning 
the  cclcr,  the  better  the  result  showing  the  interactions. 
There  is  no  restriction  as  to  how  to  do  the  correlation  and 
several  examples  will  be  presented.  The  colors  that  are 
available  to  a  input  datum  are  red,  green  and  blue.  If  a 
cell  receives  data  from  more  than  one  source  with  different 
primary  colors,  these  colors  will  combine.  The  cell  will 
show  en  the  screen  with  a  resulting  color  that  is  the  combi- 
nation of  the  input  primary  colors.  This  resulting  color  is 
not  passed  on  to  other  cells.  The  propagation  of  data  is 
associated  with  the  primary  colors,  and  this  way  a  good 
tracking  of  the  timing  that  takes  for  each  data  wavefront  to 
propagate  through  the  array  can  be  achieved.  The  screen 
display  shows  the  array  in  a  way  selected  by  the  user.  The 
cells  are  presented  with  the  shape,  position,  identification 
and  connections  requested  by  the  user.  The  colors  and  numer- 
ical results  assumed  in  each  clock  cycle  show  the  data 
interaction.  The  numerical  results  that  appear  on  the 
screen  is  in  a  fixed  format  and  with  rather  limited  preci- 
sion.  If  one  wants  greater  precision,  there  is  an  option  to 
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record  the  session  which  can  show  results  with  up  to  five 
decimals  on  printer  cutput.  These  results  can  te  checked 
against  the  known  values  to  verify  the  correct  operation  of 
the  mapped  systolic  algorithm. 
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III.  A  CASE  STUDY:  HATRIX  TglANGOIAfilZATION  BY  GIVENS 

ROTATIONS 

A.   A  REASON  FOR  HATRIX  TRIANGULARIZATION:  QR  DECOMPOSITION 

The  technique  of  CR  decomposition  is  useful  to  solve  the 
linear  least  Squares  Problem.  There  are  other  possible 
approaches  in  the  literature  to  solve  this  type  of  problem 
[Ref.  6].  We  will  select  the  QR  Decomposition  Method  as 
mentioned  by  Gentlemac  and  Rung  in  [Ref.  7:  page  19].  let's 
present  a  trief  explanation  of  the  method. 

Given  a  linear  system  defined  by 

AX  =  E 

where  A  is  a  (n  x  p)  matrix,  X  is  a  (p  x  1)  and  B  is  a  (n  x 
1)  column  vector.  5e  want  tc  find  the  vector  X  such  that 
| jr J | =| | E-AX| |  is  minimized.  The  vector  X  is  also  called 
vector  of  regression  coefficients.  This  is  in  fact  a  vector 
of  estimated  elements  and  as  such  it  strictly  should  be 
written  X  but  for  convenience  X  will  continue  to  fce  used. 
Also  a  mere  rigorous  matrix  notation  would  be,  for  example, 
A  instead  of  A.  If  we  are  able  to  find  an  orthogonal  matrix 
Q  such  that  QA=R  where  R  is  an  upper  triangular  matrix,  then 
we  will  have 

QAX  =  BX  =  QB 

and,  as  a  result,  RX  =  Q3. 

As  E,  Q,  and  E  are  all  known,  we  can  find  X  by  Back 
Substitution.  Gentleman  has  shown,  as  seen  in  [Ref.  8:  page 
329],  that  this  approach  solves  the  Least  Squares  Problem 
because  it  solves  the  normal  equations  of  the  problem 
[Ref.  6:  page  148]. 
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An  accurate  QR  Decomposition  can  be  obtained  in  several 
ways,  like  the  Gram- Schmidt  Process  or  Householder 
Transformations.  For  this  particular  study  we  will  stick  to 
the  Givers  Rotations  Method.  This  method  requires  the  use  of 
a  sequence  of  plane  rotations  on  matrices  A  and  B,  that  will 
convert  the  linear  system  to  a  representation  of  the  fcrm 

RX  =  £B 

which  is  straightforward  to  solve. 

The  Givens  Rotations  Method  has  two  different  iiplemen- 
tations  £Ref.  8:  page  331].  We  will  study  the  one  called 
"with  square  roots". 

E.   QB  DECOMPOSITION  EY  GIVENS  ROTATIONS  WITH  SQUARE  ROOTS 

If  Q  is  an  orthogonal  matrix,  Y=QA  is  called  an  orthog- 
onal transformation  and  the  columns  of  Q,  say  q{1)/  g(2)  , 
...  are  orthogonal.  In  order  to  better  understand  what  an 
orthogonal  transformation  means,  we  can  interpret  the  g(ij's 
as  unit  vectors  (because  of  their  unit  length)  that  repre- 
sent the  new  reference  frame  in  the  original  coordinates 
frame.  If  we  apply  ar  orthogonal  transformation  to  a  matrix 
(or  to  a  vector) ,  we  will  be  in  fact  changing  the  reference 
coordinates  for  that  "vector.  The  information  content  of  the 
transformed  matrix  (cr  vector)  will  be  the  same,  although 
different  from  previous  form  because  of  the  new  reference. 
The  columns  of  Q  are  given  by  the  direction  cosines  of  the 
new  reference  axes  with  respect  to  the  old  reference  system 
[Ref.  12:  page  354]. 

An  orthogonal  transformation  can  be  divided  into  a 
sequence  of  elementary  transformations.  In  the  case  of 
Givens  Rotations,  each  one  of  these  elementary  transforma- 
tions is  a  rotation  about  one  coordinate  axis  of  the  orig- 
inal reference  frame  (which  may  be  n-dimensionai) . 
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Id  geometry  we  knew  that  the  transformation 


a 
b 
c 


cos  (x1)  cos(x2)  cos  (x3) 
cos(y1)  cos(y2)  cos  (y3) 
cos(z1)  cos(z2)  cos  (z3) 


'   a1    ' 

• 

b1 

.   c1   . 

applied  to  a  vector  (a1#  hi,  c1]  will  rotate  the  original 
reference  frame  about  the  3  axes  at  the  same  time.  The 
direction  cosines  of  the  new  x-axis  are  cos(x1),  ccs  (x2)  and 
cos  (x2)  ;  the  direction  cosines  of  the  new  y-axis  are 
cos(y1),  cos  (y2)  and  cos  (y3)  ,  and  so  on.  The  transformation 
effect  achieved  by  the  matrix  cf  cosines  is  similar  to  that 
obtained  by  the  Givens  Rotation  transformation  matrix  Q* 

fle  can  split  the  cosines  matrix  into  a  product  of 
matrices  each  one  representing  a  rotation  about  one  axis  of 
the  original  frame.  This  technique  of  decomposing  the  oper- 
ation into  product  operations,  each  one  being  responsible  to 
rotate  the  system  matrix  of  an  angle  TETA  about  one  axis  of 
the  original  frame,  is  exactly  what  is  done  in  Givens 
Algorithm.  Each  individual  Givens'  Rotation  is  represented 
by  a  matrix  of  the  form 


D (j,i)  = 


10  0  0 0  0 

0  10  0 0 

0  0 

0  ..0  c  0 s 0 

0  ...  0  1  0 0 

0  0 

0  . . 0  1  0  ...  0 

0 -s 0  c  0  .  0 

0  1  .  o 

0 o 

0 0  1 


<==  i  th  row 


<==  j  th  row 
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where  sgr(c)  *  sgr  (s)  =  1,  c  =  cos (TETA)  ,  s  =  sin  (TETA)  and 
dots  represent  zeros  [Bef,  6:  page  153].  This  elementary 
rotation  will  convert  the  (j,i)  element  of  the  matrix  which 
it  premultiplies  intc  zero,  that  is,  the  (j/i)  element  of 
the  product  of  matrix  D(j,i)  and  A  will  be 

-s.a(i,i)  +  c.a(j,i)  =  0 

and  sc  it  follows  that 

d  =  sgrtfsgr (a  (i,i))  +  sgr(a{j,i))) 

c  =  a(i,i)/d    and    s  =  a{j,i)/d 

If  the  matrix  A  is  (n  x  n)  ,  it  will  be  necessary  to  have 
(n-1)  elementary  rotations  to  turn  all  elements  of  the  first 
column  (  with  exception  of  a  (1,1)  )  onto  zero,  (n  -  2)  rota- 
tions for   the  second  column,   and  so   on.   For   a  (3   x  3) 
matrix,  three  rotations  will  suffice. 

C.   MAPPING  GIVEHS  BCTATIONS  AIGOBITHM  INTO  A  SYSTOLIC  ARRAY 

The  problem  of  napping  an  algorithm  into  a  systolic 
structure  frequently  turns  out  to  be  the  subject  of  certain 
restrictions  because  the  computed  results  may  result  in  very 
small  numbers  impossible  to  be  represented  if  fixed  point 
hardware  is  used.  In  order  to  avoid  establishing  undesirable 
conditions  on  the  data,  one  must  try  to  select  an  algorithm 
possible  to  be  mapped  and  to  perform  computations  en  any 
data  within  the  selected  range  of  approximations. 

"He  have  already  presented  the  Givens  Algorithm.  The 
problem  now  is  to  understand  how  a  systolic  array  can 
realize  it.  The  reader  must  be  aware  that  the  understanding 
of  the  mathematical  algorithm  does  not  lead  to  immediate 
insight  en  how  the  array  works.  For  the  original  discussion 
on  this  algorithm  the  reader  is  referred  to  [Ref.  7]. 
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Presented  in  the  following  are  the  specifications  of  the 
cell  processors  used  in  this  systolic  array,  the  general 
cell  arrangement,  and  the  input  data  streams.  However,  the 
sight  of  such  specification  does  not  reveal  the  intricacies 
of  the  deduction  of  which  function  should  be  performed  by 
each  particular  cell  element.  A  possible  way  to  gain 
insight  as  to  how  the  algorithm  is  mapped  onto  the  array  is 
to  perform  an  algebraic  validation  that  is  extremelly  labo- 
rious. For  a  start,  suppose  that  we  want  to  triangularize  a 
(3  x  3)  matrix  A.  We  must  prenultiply  it  3  times  by  elemen- 
tary rotation  matrices  which  will  turn  the  elements  a  (2, 1)  , 
a  (3,1)  and  a  (3,2)  intc  zeros.  The  first  transformation  will 
be 


c   s  0 
-s   c  0 

0   0   1 


a(1,1)   a(1,2)   a(1,3) 

a(2,1)   a(2,2)   a  (2,  3) 

L  a(3,1)   a(3,2)   a  (3,3)  J 


This  elementary  rotation  will  turn  element  (2,1)  into  null 
on  the  product  matrix.  We  can  note  that  the  element  (-S)  at 
the  transformation  matrix  occupies  the  position  of  the 
element  that  will  be  turned  into  zero  in  the  transformed 
matrix.  The  transformed  matrix  will  become 


a'  (1,1) 

0 
a(3,1) 


(1,2) 
(2,2) 


(1,3) 
(2,3) 


a(3,2)    a(3,3) 

which  will  be  operated  by  a  second  transformation  matrix 


c   0   s 

0   1   0 

L-s   0  c 


aM1,1) 

0 
I  a(3,1) 


a»(1,2) 

a'(2,2) 
a (3,2) 


a'  (1,3) 
a»  (2,3) 
a(3,3) 


and  that  will  generate  the  matrix 


2a 


a»(1,1)      a"(1,2)      a"(1,3) 
0  a»  (2,2)      a1  (2,3) 

0  a1  (3,2)       a»  (3,3) 


Finally    the   last   transformation 


1      0      0 

0      c      s 
0    -s      c 


a"  (1,1)       a»(1,2)       a"  (1,3) 
0  a'  (2,2)       a'  (2,3) 

0  a»  (3,2)       a'  (3,3) 


will  generate  the  upper  triangular  matrix 


a"  (1,1)  a"  (1,2)  a»(1,3) 
0  a"  (2,2)  a"  (2,3) 
0        0      a"  (3,3) 


It  is  important  to  note  that  the  s's  and  c's  are 
different  in  each  transformation  matrix  and  are  calculated 
the  way  we  have  already  pointed  out,  that  is,  for  the  first 
elementary  transformation, 

d  =  sgrt  (sqr(a(1,1))  +  sgr(a(2,1))) 

c  =  a(1,1)/d    and   s  =  a(2,1)/d 

Now  the  evaluation  of  the  elements  of  each  transformed 
matrix  should  be  considered  by  performing  the  actual  multi- 
plications to  correlate  the  resulting  algebra  with  the  algo- 
rithmic processor's  equations  presented  in  Figs.  3.1  and 
3.2.  For  this  particular  algorithm  it  requires  definitely  a 
lot  of  paperwork.  This  cannot  be  accepted  as  a  good  method 
if  one   wants  to   understand  the   general  data   flow  in   the 
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systolic  array,  and  to  check  its  numeric  data.  This  is  why 
we  developed  the  SYSGRAS  to  have  the  possibility  of  doing  an 
evaluation  study  on  how  several  algorithms  found  in  the 
literature  do  perform. 

D.   A  NOHEKICAL  EXAHEIE 

Id  order  to  test  the  Givens  Algorithm  in  doing  Matrix 
Triangularization,  we  have  prepared  a  numeric  study  case.  A 
system  is  represented  by  the  matrix  equation 


2 

4 

1    ' 

"  x(1)     ■ 

'    12   " 

5 

7 

4 

. 

x(2) 

= 

3 

3 

0 

1 

L  x(3)    . 

8 

In  Appendix  C  we  shew  how  to  operate  on  this  equation  for 
triangulating  and  to  solve  it  by  Back  Substitution.  Using 
these  numerical  data,  we  will  set  up  the  problem  fcr  the 
simulator  and  be  able  to  check  out  the  simulation  results 
against  our  hand  calculated  values. 

E.   SETTING  UP  THE  PROBLEM  FOE  SIMULATION 

1  •   Sketching  the  Graphics 

The  first  step  in  preparing  the  problem  for  simula- 
tion consists  of  planning  the  graphics.  We  have  tc  consider 
the  following  points: 

How   do   we  want   our  systolic   array  arranged   on  the 

screen? 

What   kind   of  shape   will  be  used   for  the   cells  (in 

SYSGEAS  we  have  two  options:  square  and  octogcnal)? 

The  screen  coordinates  for   the  cells  (SYSGEAS  divides 

the  screen  into  a  chessboard,   and  therefore  we  have  a 

span  of  8x8  positions  to  place  the  array  cells) . 
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Which  are  the  input  and  output  ports?  This  is  depen- 
dent on  the  subroutine  that  implements  the  cell 
processor,  and  so  must  te  compatible  with  the  defini- 
tion of  the  processor  cell  subroutine. 

Which  links  are  used  to  connect  the  different  ports  of 
the  different  cells?  We  need  to  identify  everything  so 
that  the  correct  information  at  SYSGRAS  can  be 
entered. 

This  may  be  better  understood   with  an  example,   and 

we  will  continue  the  development   of  the  systolic   array  to 

perform  Matrix  Triangularization. 

For  this  particular  case,  we  will  use  three  types  of 

processors:    a  Givens  External  cell,  a  Givens  Internal  cell 

(both  with  square  roots) ,  and  a  Buffer  cell.  The  last  one  is 

used  to  help  the  visualization  of  the  input  data  array  being 

pumped  into  the  systclic  array. 

Figures  3.1,    3.2,   and   3.3  show   which  ports   are 

considered  the  active  terminals  in  the  actual  implementation 
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If  t   3   o   tnen 
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3  V~ 
C  =  r*  /  temp 
5  *  i  /  temp 
r  =  temp 


end 


Legend: 
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Figure   3.1        Givens    External   Processor  Cell 
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Figure  3.2    Givens  Internal  Processor  Cell 
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Figure  3.3    Buffer  Cell 
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in  SYSGRAS  for  these  cell  processors.  Fig-  3.4  show  how  to 
arrange  the  cells  and  their  interconnections.  It  is  impor- 
tant to  realize  to  the  fact  that  although  circular  shaped 
cells  are  presented  in  Figs.  3.1  and  3.4,  they  are  imple- 
mented in  the  SYSGRAS  as  octogonal  shaped  cells.  This 
approach  was  adopted  to  improve  the  computational  speed  of 
the  simulator.  However,  it  has  been  decided  to  keep  the 
standard  circular  shape  in  Figs.  3.1  and  3.4  .  An  addi- 
tional detail  for  the  reader  is  the  fact  that  cells  10  to  13 
in  Fig.  3.4,  buffer  cells,  do  not  show  exactly  the  same  way 
as  Fig.  3.3  does.  This  is  because  a  buffer  cell  is  imple- 
mented in  SYSGRAS  with  three  channels,  but  for  Matrix 
Triangularization  only  one  channel  is  used  (and  so  only  one 
output  port  is  shown  in  Fig.  3.4).  Attention  should  also 
be  paid  to  the  fact  that  the  arrows  shown  in  Fig.  3.4  repre- 
sent the  links  that  have  actually  been  used  in  our  simula- 
tion. Ceils  number  1,2  and  4  of  that  Figure  are  of  the  same 
type  of  the  other  sguare  shaped  cells  that  appear  in  the 
same  Figure,  although  there  is  no  arrow  drawn  on  their  right 
side.  The  reason  is  that  a  third  order  system  is  used  in  the 
simulation  and  so  there  is  no  need  to  extend  the  limits  of 
the  array  beyond  those  cells. 

This  type  of  preparation  that  has  been  presented 
here  is  very  helpful  when  entering  the  data  because  there  is 
a  large  amount  of  information  asked  by  SYSGRAS  to  fce  entered 
interactively.  If  every  detail  is  planned  beforehand,  the 
interaction  between  the  user  and  the  simulator  becomes 
faster  and  easier. 

2 .   Planning  the  Input  Data  Array 

Cnce  the  systolic  array  has  been  planned  and 
sketched,  the  user  must  prepare  the  test  data  that  exercise 
the  simulator.  This  is  pehaps  the  most  important  step  in 
the  simulation.   At  this  point  the  only  rule   to   follow  is 
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Input  Data  Array  Arrangement 
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that  a  time  shift  as  required  by  the  algorithm  has  to  be 
respected.  The  way  cclor  information  will  be  assigned  to  the 
entering  data  is  completely  up  to  the  user,  and  the  greater 
his/her  creativity,  the  better  the  simulation  effect,  and 
consequently,  the  easier  to  interpret  the  results. 

Fig.  3.5  shows  a  possible  way  to  prepare  the  data. 
"We  can  identify  in  that  figure  the  numbers  from  our  numeric 
example  referred  in  last  subsection.  The  time  shift  that  can 
he  seen  is  necessary  to  provide  the  necessary  synchronism 
for  the  systolic  processing.  In  some  algorithms,  the  output 
data  to  the  external  world  must  also  be  collected  in  a 
synchronous  fashion.  But,  for  the  particular  case  of  Matrix 
Triangularization,  as  soon  as  the  final  result  is  achieved, 
each  datum  is  frozen  in  its  cell  and  waits  for  a  pumping  out 
operatioc  whose  connection  links  are  not  being  considered  in 
our  simulation. 
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IV.  THE  ANALYSIS  OF  THE  SIMULATIONS 

A.   EIVIEED  DIFFERENCES 

Before  going  into  detailed  analysis  of  the  Matrix 
Triangularization  sone  simple  algorithms  are  presented. 

In  numerical  interpolation,  we  need  to  compute  the 
divided  differences  from  a  set  of  points  and  then  fcrm  a 
polynomial  from  these  divided  differences.  The  problem  of 
mapping  the  algorithm  for  calculating  these  differences  into 
a  systolic  array  structure  has  been  recently  addressed  in 
[Ref.  10].  In  that  paper,  Li  and  Smith  propose  a  systolic 
architecture  consisting  of  triangular  ceils.  Some  upward, 
some  downward,  according  to  that  paper,  have  the  same 
internal  architecture  but  must  perform  differently  depending 
on  their  orientaticn  with  respect  to  the  data  flow. 
Certainly,  this  poses  the  problem  of  sensing  the  direction 
of  the  data  flow  and  implementing  some  additional  logic  in 
the  cells  to  change  their  function  accordingly.  Li  and  Smith 
also  discuss  sone  implementation  problems  in  their  paper, 
and  pcint  out  that  their  actual  cell  "is  not  exactly  the 
same  as  described"  [ Bef .  10:  page  542]. 

Their  algorithm  is  implemented  in  SYSGEAS,  but  the  basic 
cell  has  keen  modified;  grouping  one  upward  triangle  and  cne 
downward  triangle  into  the  same  cell.  It  is  noticed  that  the 
upward  cell  was  in  fact  actuating  just  as  a  traf fic-orientor 
buffer  and  no  additional  information  was  being  incorporated 
into  the  data  flow.  It  wouldn't  be  cost  effective  to  have 
that  kind  of  processor  since  the  introduced  complexity  was 
due  to  the  need  to  establish  a  homogeneous  type  of  architec- 
ture based  on  triangles.  Therefore,  our  basic  cell  became 
that  shown   in   Fig.  1.1,   and   the   systolic  architecture. 
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Figure  4,1    Divided  Differences  Cell  Processor 

almost  the  same  as  that  proposed  by  Li  and  Smith,  appears  in 
lig.  H.2. 

tfe  have  investigated  the  problem   of  finding  the  divided 

differences  of  the   set  of   points  (1.0,1.0),    (1.8,2.2), 

(3.0,3.3),    (4.3,3.5)   and   (5.3,4.1)    in   the  x-y   plane. 

According  to  [ Ref .  10:   page  539],    the  first  level  divided 

differences  are 

y'(1)  =  (y(2)-y(1))/(x(2)-x(1))  = 

=    (2.2-1. 0)/(1. 8-1.0)    =    1.5000 
J1  (2)     =    (y(3)-y(2))/(x(3)-x(2))     =0.91666 
y'(3)    =    (y  (4)-y  (3))/(x(4)-z(3))    =    0.15385 
y»  (4)    =    (y  (5)-y  (4)  )  /  (X  (5) -x  (4) )    =    0.6000 
Ccntinuing      with        the      calculations, the        second      level 
divided    differences    are 

-0.29167 
-0.30512 
0.19398 
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Figure  4. 2   Divided  Differences  Array  after  Clock  Cycle  1 
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Figure  fl.3   Divided  Differences  Array  after  Clock  Cycle  2 
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Figure    1.U        Divided  Differences   Array   after  Clock   Cycle  3 
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Figure  4.5    Divided  Differences  Array  after  Clock  Cycle  4 
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and    the    third    level    divided    differences    are 

-0.00408 
0.14260 

and  the  fourth  level  is  0.03411.  This  suggests  a  kind  of 
pyramid  structure  that  can  be  seen  from  Figs.  4.2  up  to 
4.5.  The  calculation  of  the  differences  is  performed  in 
four-clock  cycles,  cne  represented  in  each  figure.  The 
bottom  row  displaying  the  first  order  differences,  the  upper 
row,  the  second  order  and  so  on  till  the  top  cell.  We  have 
related  each  point  (P1-P5)  to  a  primary  color,  that  is,  P1 
is  blue,  P2  is  red,  P3  is  green,  P4  is  blue,  and  P5  is  red. 
In  Pig.  4.2  we  can  see  the  result  of  the  first  step  of 
calculations,  with  the  mixing  of  these  colors  at  the  bottom 
cells.  Going  through  each  step  and  comparing  them  with  the 
mathematical  formulation,  we  get  a  better  understanding  of 
the  data  interaction.  The  the  final  results  in  fig.  4.5 
should  be  compared  with  the  above  calculated  values. 

B.   MATRIX-MATRIX  HUITIPLICATICN 

This  is  another  simulation  problem  that  have  been  tried 
on  SYSGRAS.  The  presentation  of  the  algorithm  has  been  dene 
at  [Ref.  Is  page  9].  Presented  in  Fig.  4.7  is  the  arrange- 
ment cf  the  systolic  array  and  how  the  data  is  pumped  into 
it.  The  elements  of  natrix  A  can  be  seen  flowing  towards  the 
lower  right  and  those  cf  matrix  3  flowing  towards  the  lower 
left.  The  resultant  matrix  C  is  pumped  upwards.  We  have 
performed  a  simulation  to  compute  the  matrix  product  AB=C  as 
below 
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The  scheme  used  for  systolic  computation  has  an  advan- 
tage due  to  the  fact  that  a  great  number  of  matrices  that 
are  used  in  typical  problems  are  band  matrices  [Bef .  9], 
Consequently,  the  systolic  array  does  not  need  to  include  as 
many  cells  as  it  would  have  to  for  the  case  of  full 
matrices.  In  this  numeric  example,  in  order  to  restrict  cur 
array  to  a  small  size  so  that  it  can  be  seen  on  the  screen 
(as  in  Fig.  4.8),  we  decided  to  set  elements  a  (1,3)  and 
b(3,1)  tc  zero.  This  way,  a  systolic  array  of  small  dimen- 
sions can  multiply  larger  matrices  with  restricted  nonzero 
band. 

A  single  type  of  cell  is  used  in  this  algorithm.  It  is 
called  Inner  Product  Step  Processor.  Its  geometry  and  algo- 
rithmic definition  are  shown  in  Fig.  4.6  .  This  same  type  of 
cell  will  be  presented  later  in  another  algorithm 
implementation. 

In  this  simulaticn  problem,  different  colors  are  attrib- 
uted to  different  rows  of  matrix  A  and  to  different  columns 
of  matrix  B.  As  we  know,  each  element  of  the  product  matrix 
C  will  be  generated  by  a  combination  of  one  row  of  A  and  one 
column  of  3.  When  these  elements  join  each  other  in  a 
systolic  cell,  we  will  be  able  to  identify  exactly  that 
combination  taking  place  at  each  cell  in  the  space-time 
frame.  We  present  here  the  coding  of  colors  that  was 
adopted: 


blue   blue   blue 
red    red   red 
green  green  green 


blue  red  green 
blue  red  green 
blue  red  green 


blue     magenta  cyan 
magenta  red     yellow 
cyan    yellow   green 
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inp(l) 


out (2) 


out (3) 


i  np  (3) 


t 


j^  out(l)  =  inp(l)  J  -feedthrough 

inp(2> 

out (2)     ■    inp(2)     ;     feedthrough 

r    =    inp(l)     *    inp(2)     +    inp(3) 

out (3)  »  r  ;  processed  output 


out ( 1 ) 


Figure  4.6    Inner  Product  Step  Processor  Cell 

To  show  how  colors  can  help  in  understanding  the  mecha- 
nism of  irultiplicaticn  in  the  systolic  array,  we  will  track 
the  generation  of  the  element  c(3,2).  fie  see  from  the  above 
color  coding  that  c  (2,2)  is  a  yellow  element,  since  it 
results  from  the  ccirbination  of  a  green  row  and  a  red 
column.    Prom  mathematics, 

c(3,2)    =   a(3, 1)  .b  (1,2)    ♦    a  (3,2)  .  b  (2,2)     ♦    a  (3,  3)  .  b  (3,  2) 

Figure  4.7  shows  a  schematic  diagram  that  corresponds  to 
clock  cycle  number  1,  the  same  cycle  is  also  shown  in 
Fig.  4.8,  in  which  elements  a  (1,1)  and  b(1,1)  have  entered 
cells  nuibers  14  and  15  respectively.  From  Fig.  4.7  it  can 
also  be  seen  that  the  element  a  (3,1)  (green  element)  will 
enter  into  the  array  (at  cell  06)  at  clock  cycle  3  (see 
Fig.  4.7),  while  element  b(1,2)  (red  element)  will  enter  at 
clock  cycle  2  (at  cell  12)  (see  Fig.  4.9).  After  that,  they 
will  run  through  the  array  and  will  meet  each  other  at  cell 
02  at  clock  cycle  5  (see  Figure  4.12).  This  meeting  gener- 
ates  the   first   partial   result   a  (3, 1 )  .  b  (1 ,2)  .   This   can 
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Figure   4.7        Systolic  Array    for   Matrix-Matrix  Multiplication 
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Figure  4.8   Hatrix-Hatrix  Multiplication  after  Clock  Cycle  1 
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Figure  4.9   Ha trix- Matrix  Multiplication  after  Clock  Cycle  2 
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Pigure  4.10   Hatrix-Hatrix  Hultiplication  after  Clock  Cycle  3 
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Pigure   4,11        Matrix-Hatrix    Multiplication   after   Clock   Cycle   4 
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Figure  4,12   Matrix-Matrix  Multiplication  after  Clock  Cycle  5 
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Figure  4.13   Matrix-Matrix  Multiplication  after  Clock  Cycle  6 
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Figure  4.14   Matrix-Batrix  Multiplication  after  Clock  Cycle  7 
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Figure  4.15   Hatrix-Batrix  Multiplication  after  Clock  Cycle  8 
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Figure  4.16   Hatrix-Matrix  Multiplication  after  Clock  Cycle  9 
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be  verified  to  be  8x1=8  by  our  numeric  example  and  by  the 
contents  of  cell  02  at  clock  cycle  5  .  This  partial  result 
with  yellov  color  (identified  as  c(3,2)  for  convenience)  is 
pumped  upwards  to  cell  09.  Performing  similar  verification, 
it  can  he  seen  that  elements  a  (3,2),  b(2,2)  and  that  partial 
result  of  c(3,2)  will  meet  at  cell  09  at  clock  cycle  6  (see 
rig.  4.13).  They  will,  then,  generate  a  second  partial 
result 

a(3,1).b(1,2)    +  a(3,2).b(2,2) 

which  is  8x1+2x7=22  as  seen  in  Fig.  4.13  at  cell  09.  By 
similar  reasoning,  the  final  result  8x1+2x7+3x10=52  is 
generated  in  cell  14  at  clock  cycle  7  (see  Fig.  4.14). 

It  can  be  seen  that  matrix  C  is  symmetrical  with  respect 
to  the  coicr  coding.  Symmetrical  elements  like  C  (3,2)  and 
c(2,3)  are  computed  in  parallel  and  pumped  up  side  by  side. 
So,  they  are  easily  tracked  with  the  help  of  colors,  and  the 
whole  operation  can  be  better  visualized. 

C.   MATRIX  TRIANGDLABIZATION  BI  GIVENS  ROTATIONS 

We  will  provide  a  brief  review  of  this  problem.  The 
nomenclature  to  be  used  is  the  same  as  that  in  Chapter  III. 
Ihe  reader  is  also  referred  to  Appendix  C,  "A  Numerical 
Example  for  Matrix  Triangularization",  which  uses  the  same 
data  as  that  described  here. 

The  linear  system  that  needs  to  be  solved  is 

AX  =  B 

where  A  is  a  nxp  matrix,  X  is  a  column  vector  of  p  elements, 
and  E,  a  column  vector  of  n  elements. 

We  will  premultiply  the  matrix  A  by  a  transformation 
matrix  Q  such  that  the  product  matrix  becomes  an  upper 
triangular  matrix  R.  To  keep  the  matrix  equality  it  is 
necessary  to  premultiply  vector  B   by  matrix  Q.    This  will 
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result  in   the  product  vector   QB.   This  operation   is  shown 
below: 

QAX  =  QB 

and  as  E  =  QA,  it  beccmes 

EX  =  QB 

The  Givens  Eotations  in  this  algorithm  under  simulation 
triangularize  the  matrix  A  and  computes  the  vector  QB 
concurrently.  This  is  possible  because 

Q(A|B)  =  (QA)  |  (QB)  =  E|  (QB) 

where  the  operator  "  |"   performs  matrix  concatenation.    For 
clarityr  if  we  define 


A  = 


2  4   1 
5   7   4 

3  0   1 


and 


B  = 


12 
3 

8 


then  A|B  becomes 


AjB  = 


2  4   1  12 
5   7   4   3 

3  0   18 


As  we  have  shown  in  Chapter  III,  the  simulation  uses  the 
cell  structure  seen  in  Fig.  3.4  and  the  test  data  seen  in 
Figure  3.5  (the  same  as  that  used  in  the  numerical  example 
cf  Appendix  C)  .  The  picture  that  appears  on  the  screen  at 
the  start  of  the  simulation  is  shown  in  Fig.  4. 17  which  the 
reader  should  compare  with  Fig.  3.4  in  Chapter  III. 
Fig.  4.17  shows  the  identification  of  each  cell  at  its  lower 
right  corner.  Cells  numbered  13  to  10  represent  the  data 
wavefront  to  be  pumped  into  the  systolic  array.  They  are  not 
part  cf  the  array,  but  only  buffer  cells  to  allow  presenta- 
tion of  the  data  to  be  pumped   into  the  array  on  the  screen. 
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The  final  elements  of  vector  QE  are  calculated  in  cells  04, 
02,  and  01-  The  final  elements  of  matrix  A  are  calculated  in 
cells  numbered  09,  08,  06,  07,  05  and  03. 

As  shown  in   Appendix  C,   the  values   for  AX  =  B   are  as 
follows  and  A|B  is  as  shown  above. 


2 

4 

1    ■ 

■   x(1)     " 

'    12    " 

5 

7 

4 

• 

x(2) 

= 

3 

3 

0 

1 

.   x(3)    . 

8 

The  test  data  are  pumped  into  the  array  as  wavefronts 
that  can  be  seen  in  cells  13  to  10  just  before  being  entered 
(the  reader  can  refer  to  Fig.  3.5  to  the  several  wavefronts, 
each  one  corresponding  to  a  different  color) .  There  is  a 
time  shift  among  the  elements  of  the  same  row  for  the  matrix 
under  triangularization  (compound  matrix  A|B).  This 
displacement  is  needed  to  establish  the  correct  timing  for 
the  systolic  operation.  In  this  simulation,  we  decided  to 
assign  a  different  color  for  each  row  of  A|3  (as  shown  in 
Fig.  3.5)  and  to  keep  track  of  the  flow  of  the  colored 
elements  through  the  array.  This  flow  can  help  us  to  under- 
stand, in  a  sequence  of  space-time  frames,  how  the  whole 
array  behaves.  The  effect  of  the  different  clock  for  subseq- 
uent cycles  can  be  seen  from  Fig.  4.18  to  4.26  .  In 
Fig.  4.18,  element  a  (1,1)  of  matrix  A  enters  into  cell  09. 
"We  can  see,  in  cell  13  element  a  (2,1),  and  in  cell  12 
element  a (1,2).  They  are  ready  to  be  clocked  in  at  the  next 
clock  cycle.  This  happens  as  seen  in  Fig.  4.19,  when  cells 
09  and  C8  display  the  stored  result  of  the  computation 
performed.  Notice  that  cells  09,  07,  and  03  actuate  like  a 
reflector  for  the  data  wavefront  that  is  going  downwards. 
Ihey  rotate  the  data  flow  direction  by  90  degrees  counter- 
clockwise. As  a  result  of  the  interaction  between  the  wave- 
fronts  that  go  towards  the  right  and  that  going  downwards, 
there  is  a  resulting  wavefront   that  flows  towards  the  lower 
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right.  This  is  the  important  observation  because  it  shows 
how  the  effect  of  the  input  data  spreads  over  the  array 
(this  is  why  we  decided  to  associate  this  wavefront  with 
identical  colors) .  The  wavefront  flowing  towards  the  right 
carries  information  about  the  angle  that  the  row  elements 
must  re  rotated.  This  information  must  arrive  at  each  cell 
just  in  phase  with  information  of  the  matrix  element  heing 
rotated,  which  is  being  transferred  to  each  cell  by  the 
downward  wavefront.  At  each  clock  cycle  a  new  rotation  angle 
is  calculated  at  the  inclined  boundary  cells  (Givens 
External  Cells,  namely  09,  07  and  03)  and  transmitted  tc  the 
rightest  cells  at  the  same  row  (that  is  from  cell  09  to 
cells  C8,  06  and  04  in  the  upper  row,  from  cell  07  to  cells 
05  and  02  in  the  seccnd  row  and  from  cell  03  to  cell  01  in 
the  lower  row) .  Partial  results  are  .  generated  at  each 
systolic  array  element  at  each  clock  cycle.  At  the  moment 
when  the  downward  wavefront  start  feeding  zeros  into  a 
systolic  array  element,  it  freezes  its  content  and  do  not 
modify  it  any  more.  We  have  selected  the  black  color  to 
indicate  that  the  input  data  are  bringing  no  inf ornta ticn. 
The  black  wavefront  flowing  through  the  systolic  array  acts 
as  a  freezing  wavefrcnt.  Figs.  4.19  up  to  4.26  show  the 
effect  of  the  remaining  clock  cycles.  In  Fig.  4.26  we  have 
the  final  result  of  the  triangulariza tion  frozen  into  the 
cells.  The  upper  triangular  matrix  R  and  the  vector  QB  have 
teen  computed.  It  can  be  checked  out  against  those  values 
shown  in  Appendix  "A  Numerical  Example  for  Matrix 
Triangularization" .  The  computed  results  are  ready  to  be 
used  in  the  Back  Substitution  to  solve  the  system  equations. 
In  order  to  transfer  the  data  from  the  cells  to  the  hardware 
that  performs  the  Back  Substitution  operation,  a  special 
connection  which  is  net  shown  here  becomes  necessary.  But, 
it  is  not  required  here  for  the  understanding  of  the  Givens 
Rotations  process. 
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Figure  4.17 


Matrix  Triangularization  Array 
Initialization 
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Figure  4.18   Hatrix  Triangularization  Array 
after  Clock  Cycle  1 
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Figure  4.19   Matrix  Triangular ization  Array 
after  Clock  Cycle  2 
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Figure  4.20   Matrix  Triangularization  Array 
after  Clock  Cycle  3 
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Figure  4.21    Hatrix  Triangularization  Array 
after  Clock  Cycle  4 
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Figure   4.22        Matrix  Triangularization   Array 
after  Clock   Cycle   5 


61 


Figure  4.23   Matrix  Triangular ization  Array 
after  Clock  Cycle  6 
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Figure  4.24   Matrix  Triangularization  Array 
after  Clock  Cycle  7 
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Figure  4.25    Hatrix  Triangularization  Array 
alter  Clock  Cycle  8 
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Tigure  4.26   Hatrix  Triangular ization  Array 
after  Clock  Cycle  9 
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We  will  now  present  a  different  kind  of  perspective  to 
the  reader  on  how  tie  algorithm  performs.  The  geometrical 
interpretation  of  Givens  Rotations  is  extremely  helpful  to 
provide  tetter  insight.  A  matrix  can  be  considered  as  repre- 
senting a  set  of  vectors  (as  many  as  the  number  of  columns) 
in  a  n-dimensional  space  (n  equals  the  number  of  rows)  .  If 
the  matrix  has  only  three  rows,  their  columns  are  vectors  of 
a  three  dimensional  sjace,  and  the  elements  of  the  first  row 
are  the  components  ever  the  x-axis  of  the  column  vectors. 
The  rotation  operation  does  not  rotate  the  vectors.  The 
reference  frame  (coordinate  axes)  is  the  one  that  is 
rotated,  and,  as  a  conseguence,  the  description  of  the 
vectors  in  terms  of  their  components  with  respect  to  the  new 
reference  becomes  different.  When  a  matrix  is  rotated  to 
become  an  upper  triangular,  the  first  column  is  transformed 
in  such  a  way  that  or.ly  element  (1,1)  will  be  nonzero.  This 
means  that  the  first  column  vector  is  positioned  over  the 
x-axis  ir  the  new  reference  frame.  As  the  vector  is  the  same 
as  before,  the  numerical  value  of  the  new  element  (1,1)  must 
be  egual  to  the  length  of  the  vector.  Element  (1,2)  of  the 
rotated  matrix,  for  example,  is  the  x-component  of  the 
second  column  vector  in  the  new  reference  frame.  Since  the 
relative  position  of  both  vectors  is  unaltered,  the  new 
value  of  that  element  must  be  egual  to  the  projection  of  the 
second  column  vector  over  the  first  column  vector,  since 
this  last  one  is  now  along  the  new  x-axis.  The  geometrical 
interpretation  can  be  extended  to  all  elements  of  the  array. 

"We  will  track  tie  build  up  of  the  numerical  values  of 
cells  09  (that  stores  element  (1,1))  and  08  (that  stores 
element  (1,2)).  This  allows  us  to  study  the  operation  of 
both  tjpes  of  cells  used  in  this  algorithm  (cell  09  is  an 
outer  cell  and  cell  08  is  an  inner  cell) .  It  will  also 
provide  a  comparison  between  the  geometric  interpretation 
presented  above  and  the  algebraic  values  shown  in  graphics. 


66 


.1 


If  t   =  O   trten 
besi  a 
c   =    1.0 
s    =«  0.0 
end 
else  begin 


temp   =*     Vr*   +  zl 
C    =   r    /   temp 
5   ^  c    /    temp 
r  =  temp 


end 


Legend:   r   =   residue 


Figure  4-27    Givens  External  Processor  Cell 

"We  will  start  with  cell  09.  From  Fig.  4.27  we  see  that 
this  cell  receives  the  elements  of  the  "nonrotated"  first 
column  vector,  computes  the  rotation  angle  (calculating  its 
outputs  cosine  c  and  its  sine  s)  ,  and  stores  the  "rotated" 
x-component  r  of  the  first  vector.  The  other  components  are 
obviously  zeros  and  correspond  to  the  other  elements  of  the 
first  column  (cells  cf  the  lower  array  that  are  not  shown 
because  their  contents  are  always  zeros) .  The  rotation 
angle  information  is  passed  as  output  to  cell  0  8  to  be  used 
to  "rotate"  the  second  vector.  Table  I  shows  how  the  param- 
eters r,  c,  and  s  of  this  ceil  respond  to  the  external 
inputs.  These  values  can  be  verified  with  the  help  of  the 
cell  algorithm  presented  in  Fig.  4.27  .  The  description  cf 
the  operation  can  te  followed  through  the  sequence  of 
Figs.  4.28  to  4.30. 
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TABLE   I 

...,___    7 

Time   Descr. 

iption  of  Outer   Cell  Operation 

Cycle 

Input   z 

Output   c          Output    s 

Residue   r 

0 
1 
z 

4 

0.0 
2.0 
5.0 
3.0    { 
0.0 

[black) 

blue) 
red) 
green) 
|  black) 

1.000                   0.000 
0.000                   1.000 
0.371                   0.928 
0.874                   0.487 
1.000                   0.000 

0.000 
2.000 
5.385 
6.  164 
6.  164 

. 

• 

At  clock  cycle  0  no  input  has  been  received  and  so  there 
is  nothing  to  rotate.  The  rotation  angle  is  zero  (c=1.0  and 
s=0.0).  At  cycle  1,  the  first  element  is  received.  The  first 
element  is  the  old  x-component  of  the  first  column  vector. 
As  it  is  over  the  x-axis,  the  reference  does  not  need  to  be 
rotated  now.  However,  when  this  occurs,  because  of  the  way 
the  algorithm  is  implemented  {we  will  see  the  reason  when 
analysing  cell  08)  ,  an  angle.  90  degrees  is  informed  to  cell 
03  (c=0.0  and  s=1.0)  although  the  value  r  stored  in  cell  09 
is  2.0,  equal  to  the  input.  At  clock  cycle  2,  the 
y-component  of  the  old  first  column  vector  is  entered.  Now  a 
rotation  is  necessary  to  keep  the  x-axis  of  the  reference 
frame  over  the  first  column  vector.  The  rotation  angle  is 
computed  (c=0.371  and  s=0.928)  and  the  reference  fraice  is 
rotated  (about  the  z-axis) .  The  new  x-component  (stored  in 
r)  is  the  square  root  of  the  summation  of  the  squares  of  the 
old  x  and  y-componen ts,  that  is  5.385  (corresponding  tc  the 
projection  of  the  first  column  vector  over  the  x-y  plane)  . 
At  cycle  3,  the  z-coraponent  is  entered.  As  the  last  rotation 
resulted  in  a  vector  over  the  x-axis,  the  combination  of 
this  with  the  just  entered  z-component  will  result  in  a 
vector  in  the  x-z  plane  deviated  from  the  x-axis  because  of 
the   newly   arrived   z-ccmponent.    Again   a   rotation    is    necessary 
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Figure  4.28    Beference  Fraae  before  Rotations 

to  keep  the  x-axis  over  the  first  column  vector.  The  new 
x-component  is  6. 16  4.  At  clock  cycle  4  a  black  datum  is 
received  (a  zero  value)  and  that  freezes  the  contents  of 
cell  09  with  its  final  value.  No  more  rotations  are  required 
and  the  output  of  the  cell  informs  rotation  angle  zero 
(c=1.0  and  s=0.0)  . 

New  we  will  follow  the  build  up  of  the  contents  of  cell 
08  (the  right  side  neighbor  cell  of  cell  09) .  Cell  08  stores 
element  (1,2),  the  x-component  of  the  second  column  vector. 
As  done  before,  we  present  table  II  with  inputs  and  values 
stored  in  this  cell  at  each  clock  cycle.  In  this  case  we 
will  disregard  the  numerical  output  of  this  cell  because  we 
will  concentrate  on  the  build  up  of  the  residue  r  in  this 
cell.  From  Fig.  4.31,  this  cell  receives  as  input  the  rota- 
tion angle  (inputs  c  and  s) ,  by  which  the  reference  frame 
has  changed,  to  be  used  to  compute  the  new  x-component  of 
the  second  column  vector  (to  be  stored  at  r) .  The  ether 
input   refers  to   the  elements   of   the  "nonrotated"   second 
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Figure  4-29   Reference  Frame  after  First  Rotation 


Figure  4-30   Reference  Frame  after  Second  Rotation 
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Figure  4.31    Givens  Internal  Processor  Cell 

column  vector  (input  z)  .  The  outputs  of  this  cell  are  the 
rotation  angle  (c  and  s  are  passed  as  output  to  the  right 
neighbor  cell  05)  and  rotation  information  to  the  neighbor 
cell  immediately  below  cell  08  to  "rotate"  the  other  compo- 
nents of  the  column  vector  as  a  result  of  the  reference 
frame  rotation. 

The  operation  of  this  cell  is  more  difficult  tc  visu- 
alize. Figure  4.32  will  be  used  for  the  explanation.  It  only 
shows  the  x-y  plane.  Suppose  the  second  column  vector  is  A, 
as  seen  in  that  Figure.  The  eld  reference  frame  is  x1-y1. 
The  components  of  A  with  respect  to  that  frame  are  ax  and 
ay.  In  our  numerical  example,  ax=4.0  is  the  first  input 
z(i)  to  the  cell.  Tie  second  input  z (i)  to  the  cell,  on  the 
following  cycle,  is  ay=7.0.  Suppose  the  reference  frame  is 
rotated  about  the  z-axis  to  position  x2-y2  to  keep  the 
x-axis  along  the  first  column  vector,    which  is  shown  as  3. 
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TABLE  II 
Ti«e  Description  of  Inner  Cell  Operation 


Cycle   Input  c 


Input  s 


Input  z   Residue  r 


0 
1 
2 
3 
ZJ 
5 


1.000 
1.000 
C.000 
0.371 
0.874 
1.000 


black) 
black) 
blue) 
red) 

green) 
lack) 


0.00  0 
0.000 
1.000 
0.928 
0.487 
0.000 


black) 

black) 

blue) 

red) 

green) 

black) 


0.  0  (black) 
0.  0  (black) 
4.0  (blue) 
7.0  (red) 
0.  0  (green) 
0.0  (black) 


0.000 
0.000 
4.000 
7.985 
6.976 
6.976 


Figure   4.32        Gecnetric   Interpretation   of  a    Rotation 

Instead  of  evaluating  the  components  of  A  in  this  new  frame, 
let  us  study  the  effect  over  its  components  ax  and  ay.  In 
this  new  frame,  ax  will  be  decomposed  into  components  axx 
(over  the  new  x-axis)  and  axy  (over  the  new  y-axis)  . 
Similarly,      ay    will    be  decomposed      into   components   ayx     (over 


72 


the  new  x-axis)  and  ayy  (over  the  new  y-axis) .  The  component 
of  A  over  the  new  x-axis  is  the  summation  of  axx  and  ayx. 
This  will  he  the  new  element  (1,2).  As  seen  in  Fig-  4.32, 

axx  =  ax  .  cos  (alf  a)   and  ayx  =  ay  .  sin  (alf  a) 
and  as  element  (1,2)  is 

r  =  axx  +  ayx 
it  results 

r  =  ax  .  ccs(alfa)  +  ay.  sin  (alf  a) 

As  ax  was  the  previous  r  and  ay  is  the  newly  arrived  input 
z  (i) ,    in   computer   algorithmic    language    we    have 

r   =   c(i)    *   r   ♦   s(i)     *   z  (i) 

as  in  Fig.  4.31  .  Substituting  the  values  of  clock  cycle  2 
into  this  equation,  we  find 

r  =  0. 0  *  0.0  +  1.0  *  4.0 

r  =  4.0 

and  this  can  be  checked  against  the  above  table.  At  clock 
cycle   3,    the   values    give   us 

r   =    0. 371    *    4.0    ♦    0.923    *    7.0 

r   =   7. 985 

that  also  checks  against  the  above  table.  Doing  the  same  for 
clock  cycle  4,  we  get  r=6-976.  This  gives  the  final  value  of 
the  x-ccmponent  of  the  second  column  vector.  The  rctaticn, 
however,  will  affect  all  other  components.  Let  us  describe 
how    this   occurs.    The   component    of   A    over    the    new    y-axis   is 

z  (o)    =   ayy   -   axy 

z  (o)    =    ay    .    cos  (alf  a)    -   ax    .    sin  (alf  a) 
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and  in  algorithmic  language, 

z  (o)  =  c(i)  *  z  (i)  -  s  (i)  *  r 

This  information  z(c)  is  passed  to  the  cell  immediately 
below  tc  generate  the  new  y-component  of  the  second  column 
vector. 

A  further  point  to  be  noticed  is  that  if  the  rotation 
angle  input  to  an  inner  cell  is  zero  (c=1.0  and  s=0.0),  the 
cell  will  not  change  the  stored  value  at  r.  This  is  why  the 
rotation  angle  is  informed  as  90  degrees,  as  mentioned 
previously,  when  the  first  element  is  received  at  cell  09. 
This  triggers  cell  C8  to  receive  and  store  the  z  injrut  at 
the  following  clock  cycle.  At  the  end  of  clock  cycle  02,  the 
z=4.0  input  is  stored.  No  rotation  is  performed,  so  r=4.0. 
The  following  clock  cycles  impose  the  modification  of  the 
contents  of  this  cell  because  of  the  changes  in  the  refer- 
ence frame.  As  soon  as  the  input  z  freezes  (at  clock  cycle 
5) ,  the  residue  r  of  the  cell  also  freezes.  It  can  be 
observed,  if  the  outputs  of  cell  09  and  its  inputs  tc  cell 
08  are  compared,  the  transmmission  delay  of  one  clock  cycle 
from  a  cell  tc  another.  It  can  also  be  noticed  that  the 
inputs  to  cell  08  always  have  the  same  color.  This  simula- 
tion was  purposely  designed  this  way  to  show  the  need  for 
sinchronization  between  the  wavefronts  that  propagate 
through  the  array. 

Certainly  we  cannct  present  all  this  complexity  with  the 
simulator.  The  algorithm  of  Givens  Rotations  is  the  most 
complex  that  we  have  traced  in  our  survey.  However,  the 
simulator  has  been  a  fundamental  tool  to  perform  a  numerical 
study  and  to  verify  the  correctness  of  the  algorithm. 
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D.   BACK  SUBSTITUTION 

He  will  use  the  results  of  the  previous  section  as  input 
data  for  the  simulation  described  in  this  section.  This  way, 
with  these  two  sections,  we  can  go  through  the  complete 
problem  described  in  the  Appendix  "A  Numerical  Example  for 
Matrix  Triangulariza tion".  That  is,  to  search  for  the  vector 
X  in  matrix  equation  5X=B.  The  process  of  Back  Substitution 
is  described  in  [Bef.  7:  page  24 ]. 

One  of  the  cells  used  in  this  algorithm  is  shown  in 
Figure  4.33  .  The  above  reference  presents  the  cell  shewn 
in  Pig.  4.33  as  a  circular  shaped  cell  named  Back 
Substitution  Cell.  Since  SYSGRAS  would  take  too  much  tiire  to 
draw  circular  shapes,  an  octogonal  shape  is  used  instead 
(see  cell  number  05  in  Figure  4.37). 

The  other  type  cf  cell  used  in  this  algorithm  is  the 
same  Inner  Product  Step  Processor  that  was  presented  in 
Fig.  4.6  to  do  Matrix-Matrix  Multiplication.  This  is  an 
interesting  aspect  of  systolic  arrays:  many,  algorithms  that 
appear  in  the  literature  are  implemented  with  the  same  types 
of  cells.  However,  the  geometry  of  the  array,  in  this  algo- 
rithm, requires  the  cell  to  be  square  shaped.  Although  the 
cell  is  functionally  the  same  as  shown  in  Fig.  4.6,  we 
present  it  again  in  Figure  4.34  to  match  the  way  Fig.  4.3C 
represents  it  as  part  of  the  structure  (see  cells  04  and 
02)  . 

The  mathematical  background  for  solving  a  triangular 
linear  system  involving  a  lower  triangular  array  has  been 
presented  by  Kung  in  [Ref.  1:  page  19].  We  will  adopt  here 
his  explanation. 

Suppose  the  systen  equation  to  be  solved  is 

RX  =  D 
where  R  =  (r(i,j))  is  a  nonsingular   nxn   lower   triangular 
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Figure  4-33    Back  Substitution  Main  Cell 
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Figure  4.34    Inner  Product  Step  Processor  Cell 

matrix  and  D  is  a  a  n-vector,   both  being  given-   To  coirpute 
the  vector  X,  we  can  use  Forward  Substitution  as  follows: 

k  :=  1 

y(i,1)  :=  0 
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repeat 

y(i,k  +  1)  :=  y(i,k)  +  r(i,k)  -  x(k) 

k  :=  k  +  1 
until  k  =  i 
x(i)  :=  (d(i)  -  y  (i,i))  /  r  (i,i) 

The  above  algorithm  can  be  used  to  calculate  the 
elements  cf  vector  X  in  the  seguence  x(1),  x(2),  ...  and  so 
on.  This  algorithm  can  be  implemented  in  a  systolic  array  as 
will  be  shown.  Y  =  (y(i,k))  is  a  vector  of  partial  results 
that  allcws  the  recurrence  to  build  up. 

In  our  case,  since  the  interest  is  in  Back  Substitution, 
vie  will  simply  enter  the  data  into  the  systolic  array  in  an 
crder  reverse  to  that  proposed  by  Kung  in  his  paper.  Let  us 
make  this  point  more  clear.  We  do  Forward  Substitution  when 
we  have  a  lower  triangular  array.  In  this  case  we  solve 
first  fcr  x{1),  next  for  x (2)  and  so  on.  Following  the 
seguence  proposed  by  Kung,  vector  X  should  be  pumped  into 
the  systolic  array  as  x(1)  ,  x(2)  and  x  (3)  .  The  same  for 
vector  QB.  Since  here  we  are  doing  3ack  Substitution,  we 
enter  the  vector  X  in  a  reverse  seguence  x  (3)  ,  x(2),  and 
x(1)  .  QE  is  also  entered  the  same  way.  The  way  matrix  P. 
being  pumped  into  the  systolic  array  is  modified  with 
respected  to  that  prcjosed  by  Kung.  For  instance,  the  first 
of  its  elements  to  be  pumped  into  the  systolic  array  is 
r(3,3)  (see  cell  10  in  Fig.  4.36)  which  is  required  to 
compute  x(3).  In  the  Forward  Substitution  Method,  the  first 
element  pumped  should  be  r(1,1)  to  compute  x(1)  .  The  ether 
elements  of  R  are  rearranged  accordingly. 

Ihe  system  equation  to  be  solved  in  this  simulation  is 

EX  =  QB 

where  B  is  an  upper  triangular  matrix  and  QB  is  a  n-vector 
that  resulted  from  the  transformation  seen  in  the  previous 
Secticn. 
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Figure  4.35  presents  the  arrangement  of  the  data  with 
respect  to  the  systolic  array.  The  data  elements  are  drawn 
in  such  a  way  that  the  required  data  synchronization  to 
perform  the  operation  is  evident.  Vector  QB  and  matrix  R 
actually  carry  data  into  the  systolic  array  for  processing. 
Vector  Y  enters  the  array  bringing  no  data.  Its  elements  are 
all  zeros  at  the  moment  represented  in  Fig.  4.35  .  Its 
values  are  modified  as  it  flows  through  the  array.  Vector  X 
is  generated  into  cell  05,  the  Back  Substitution  Cell,  and 
its  elements  are  pumped  through  the  array  in  a  direction 
opposite  to  that  of  vector  Y.  As  they  go  through  cells  04 
and  02,  they  combine  with  elements  from  matrix  R  to  build  up 
the  elements  of  vector  Y.  Finally,  X  will  come  out  with  its 
final  value  when  it  leaves  the  array,  from  cell  number  02  of 
Pig.  4.35. 

Figures  4.35  and  4.36  should  be  compared.  This  last  one 
presents  the  same  arrangement  as  the  former,  but  it  shows 
how  the  picture  appears  on  the  screen.  Matrix  E  is  on  the 
upper  block  of  the  cells  (cells  number  07  to  21)  (Refer  to 
Appendix  C  to  compare  the  numerical  values) .  This  block  dees 
not  belong  to  the  array  itself  and  is  used  only  for  presen- 
tation of  the  data.  The  systolic  array  is  represented  by 
cells  C5,  04,  and  02.  The  elements  of  vector  QB  are  intro- 
duced from  its  left  side  (at  cell  06).  The  vector  of 
partial  results  Y,  a  string  of  zeros  separated  by  one  clcck 
cycle  delay,  is  pumped  in  from  the  right  side  (at  cell  03)  . 
The  output  that  will  collect  the  solved  elements  of  vector  X 
is  represented  by  cell  01.  Attention  to  the  fact  that  the 
numbers  that  are  displayed  at  cells  04  and  02  are  the  values 
assumed  by  the  elements  of  vector  Y  as  they  flow  leftwards 
through  the  array.  Ihe  number  displayed  in  cell  05  is  the 
element  of  vector  X  being  computed  and  that  in  cell  01  is 
the  element  of  X  being  pumped  out  to  the  external  world. 
Another   point   to   notice   is   the   way   the   bidirectional 
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connections  between  cells  05  and  04,  and  04  and  02  is  imple- 
mented by  SYSGRAS.  Only  one  bidirectional  arrow  is  used. 

fle  decided  to  associate  the  elements  of  each  column  of 
the  matrix  R  (third  column  in  blue,  second  in  red,  and  first 
in  green)  and  the  corresponding  row  elements  of  vectors  X 
and  QB  with  the  same  color-  (e.g.  the  element  number  3  of 
the  QE  will  interact  only  with  the  column  number  3  of  the 
matrix  R) .  Cell  processor  number  5,  the  so  called  Back 
Substitution  Cell  is  shaped  differently,  as  mentioned 
earlier,  to  emphasize  its  special  purpose  in  the  systolic 
array.  This  cell  processor  computes  the  elements  of  vector  X 
from  the  received  data  and  pumps  them  out  to  cell  04  to  be 
utilized  for  calculation  of  the  previously  referred  partial 
results.  flhen  the  X  elements  complete  their  trip  through 
the  array,  they  are  pumped  out  at  cell  01. 

The  manner  in  which  color  is  used  to  provide  information 
is  now  described.  The  elements  of  vectors  QB,  X  and  Y,  as 
well  as  matrix  R,  have  primary  colors  (red,  blue  and  green), 
fthen  they  meet  elements  of  different  primary  colors,  the 
cell  where  this  meeting  takes  place  assumes  a  secondary 
color  that  is  the  result  of  that  combination.  The  elements 
of  vector  Y,  for  exanple,  can  be  seen  with  their  original 
color  in  cell  03,  before  mixing  up  with  others.  During  their 
trip  through  the  array  they  keep  that  color,  but  as  a  result 
of  their  combination  iiith  different  color  elements,  the  cell 
where  they  might  be  may  present  different  secondary  colors. 
However,  we  should  keep  in  mind  that  only  primary  colors 
travel  from  a  cell  to  another.  Secondary  colors  are  static. 

To  make  these  points  more  clear,  we  will  track  the 
formation  of  elements  x(3)  and  x  (2)  .  Since  R  is  an  upper 
triangular  matrix,  we  have 

x(3)  =  gb(3)  /  r(3,3) 
Substituting  numerical  values,  as  shown  in  Appendix  C, 

x  (3)  =  -10.593  /  0.842  =  -12.571 
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Figure  4-36   Bad  Substitution  Array  Initialization 
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As  it  has  teen  pointed  out,  the  Back  Substitution  Cell,  cell 
number  05  in  Fig.  4,37,  is  responsible  for  computing  the 
elements  of  X.  That  Figure  presents  the  computation  of 
element  x(3).  Element  gb(3)  was  pumped  in  from  cell  06  (see 
Fig.  4.36)  and  element  r(3,3)  from  cell  10  (see  same 
Figure).  It  can  also  be  seen  in  Fig.  4.37  that  x(3)  is  blue 
since  it  was  formed  ty  the  combination  of  gb  (3)  and  r(3,3), 
loth  tlue  (remember  that  the  third  column  of  E  has  been 
coded  blue).  After  its  computation,  element  x(3)  is  pumped 
through  the  array  to  cell  04  (clock  cycle  2,  Fig.  4.38),  to 
cell  02  (clock  cycle  3,  Fig.  4.39)  and  finally  to  the 
external  world  (clock  cycle  4,  Fig.  4.40)  at  cell  01.  During 
this  trip,  its  value  is  used  for  computation  of  the  elements 
of  vector  Y  which  are  necessary  for  computation  of  the  other 
elements  of  vector  X.  Now  let  us  track  the  formation  of 
x(2)  . 

From  algebra,  we  have 

x(2)  =  (gb(2)  -  x(3).r(2,3>)  /  r(2,2) 

The  partial  result  x(3).r(2,3)  is  computed  at  clock  cycle  2, 
in  cell  04.  This  cell  receives  x (3)  from  ceil  05  (competed 
at  clcck  cycle  1,  see  Fig.  4.37)  and  r(2,3)  from  cell  08 
(see  Fig.  4.37).  This  partial  result  is  called  y (2)  (color 
coded  red  because  it  will  contribute  to  the  formaticn  of 
x(2),  which  color  code  is  red)  .  This  time,  y  (2)  ,  being  orig- 
inally red,  appears  in  cell  04  as  magenta  because  the 
contribution  of  y(2)  is  activated  blue  when  it  interacts 
with  tlue  r  (2, 3)  and  tlue  x(3).  At  this  point  of  the  compu- 
tation all  necessary  data  to  compute  x (2)  is  available. 
Element  y  (2)  is  stored  in  cell  04,  element  gb  (2)  is  ready  to 
enter  the  array  (see  cell  06,  in  Fig.  4.38),  and  so  is 
element  r  (2,2)  (see  cell  10  of  Fig.  4.38).  At  clock  cycle  3 
these  elements  are  pumped  into  the  Back  Substitution  Cell 
(cell  05)  (see  Fig.  4.39)  and  the  computation  of  x  (2)   takes 
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place.        Substituting   numerical     values       (as     seen   in      these 
cells)    into  the   above   equation   it  becomes 

x(2)     =     (-0.566    +    11.538)     /    4.042    =    2.714 

Element  x{2)  assumes  color  red  because  all  factors  that 
contributed  to  its  formation  had  that  color  code.  After  its 
computation,  x(2)  is  pumped  through  the  array  via  cell  04 
(clock  cycle  4,  Fig.  4.40),  cell  02  (clock  cycle  5, 
Fig.  £1.41)  and  finally  pumped  out  to  cell  01.  During  this 
trip,  similarly  for  x(3),  it  contributes  to  the  formation  of 
y(1),  another  element  of  the  vector  of  partial  results.  This 
will  te  required  to  the  computation  of  x(1). 

The  sequence  of  pictures  from  Fig.  4.36  to  4.44  displays 
the  whole  computational  process  in  the  systolic  array 
according  to  the  algorithm. 

E.   FBBTHEB  EXPLOR ATICHS 

The  potential  of  SYSGEAS  goes  far  beyond  the  implementa- 
tion of  systolic  algorithms.  Presently,  because  of  pratical 
importance,  extensive  research  is  being  carried  out  on  the 
investigation  of  faults  in  the  actual  systolic  devices 
[Hef.  11].  How  to  circumvent  those  faults  and  which  effect 
they  do  have  on  the  results  are  some  of  the  subjects  under 
study.  SYSGEAS  can  be  used  as  a  valuable  tool  in  performing 
such  studies,  because  of  the  simplicity  of  its  interface 
with  the  user  and  because  of  the  possibility  of  inclusion  of 
possible  subroutines  to  support  the  algorithms.  To  emphasize 
its  versatility,  we  address  the  fact  that  a  processor  cell 
may  have  a  large  numfcer  of  input/output  ports  (although  it 
is  presently  set  up  for  a  maximum  of  four) .  This  allows  a 
large  number  of  connections  with  other  cells  or  the  outside. 
This  characteristic  car  be  used,  as  example,  to  inject  data 
into  a  cell  that  is   situated  in  any   position  in  the  array. 
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Figure  1- 37   Back  Substitution  Array  after  Clock  Cycle  1 
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Figure  U.38   Back  Substitution  Array  after  Clock  Cycle  2 
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Figure  1.39   Back  Substitution  Array  after  Clock  Cycle  3 
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Figure  1.40   Back  Substitution  Array  after  Clock  Cycle  4 
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Figure  1.41    Back  Substitution  Array  after  Clock  Cycle  5 


38 


Figure  4.42   Back  Substitution  Array  after  Clock  Cycle  6 
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Figure  4.43   Back  Substitution  Array  after  Clock  Cycle  7 
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Figure  4.44   Back  Substitution  Array  after  Clock  Cycle  8 
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These  data  could  consist  of  logical  commands  that  would 
alter  the  function  cf  the  processor  for  the  selected  cell. 
These  commands,  in  a  particular  application,  could  "kill" 
the  cell,  or  degrade  its  performance  according  to  a  desired 
need.  Another  approach,  that  would  reduce  the  data  entry 
volume,  would  be  to  design  a  subroutine  to  implement  a 
defective  processor  and  assign  that  processor  to  the  target 
cell.  Ihe  color  feature  can  also  be  used  to  analyse  the 
effect  that  faults  at  a  particular  cell  will  have  over  the 
ethers.  If  a  color  is  injected  at  selected  input  ports 
(depending  on  the  software  of  the  subroutine  that  supports  a 
processor) ,  it  will  spread  over  the  cells  that  receive  data 
from  this  faulty  cell  under  evaluation,  and  will  display  the 
"bad  sector"  on  the  screen. 

Many  other  possible  studies  related  to  systolic  arrays 
may  te  considered  due  to  the  flexibility  of  the  SYSGEAS 
software. 
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7.    THE    INI EB ACTION    WITH    THE   SIMULATOR 

A.       SCFT11RE   REQUIREHFBTS   AND    PROCEDURES 

Id   order  to   operate   with   SYSGRAS,    the    user   must   have   the 
following   files. 

SYSTOY.PAS 
SYSFCE.FOB 
L0GIN.COM 

JUBK.CCM 


The  user,  after  compiling  SYSTOY.PAS  and  SYSFOR.FCR, 
must  link  both  relocatable  codes  with  SIGGRAPH  core,  which 
is  at  CS3202.CORE  library.  To  do  this,  the  command  to  be 
used  is 

LINK  SYSTOY,SYSFOR, (CS3202. CORE)  CORE/LIB 

In  order  to  run,  enter  the  command  "RUN  SYSTCY".  The  code 
will  address  the  RAM1F.K  peripheral  system  when  running,  and 
this  device  should  be  turned  on  and  ready  to  operate. 

B.   ENTERING  A  NEW  PECBLEM 

When  the  command  "RUN  SYSTCY"  is  entered,  the  simulator 
starts  prompting  all  questions  that  must  be  answered  by  the 
user  to  enter  the  problem.  In  Appendix  D  we  have  recorded  a 
session  of  the  Simulator  to  make  the  interaction  more  under- 
standable. It  was  cur  original  intention  to  present  this 
recording  of  the  session  as  a  guide  to  SYSGRAS  for  the 
Matrix  Triangulariza tion  problem. However ,  due  to  the  large 
amount  of  material  that  needs  to  be  presented,  we  decided  to 
include  a  short   "dumay"  session  in  Appendix   D  for  clarity. 
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It  shows  hew  to  enter  the  data  and  how  to  handle  an  incor- 
rect input.  The  simulator  permits  recovery  from  entry  errors 
and  permits  display  of  input  data  as  many  times  as  desired. 
The  user  has  the  following  choices  when  dealing  with  the 
simulator: 

select  a  new  problem  run  or  a  previous  problem  session 

record   the  session  in  a   separate   file   for   later 

printing 

ability  to  interrupt  the   graphics  presentation  at  the 

end  of  each  clock  cycle 

It  is  recommended  that  the  user  follow  exactly  the  same 
nomenclature  that  was  mentioned  in  Chapter  III  Section  E  to 
avoid  errors  that  night  be  difficult  to  troubleshoot . 

C.   REVIEW  OF  A  PREVIOUS  SESSION 

We  have  included  in  Appendix  E  a  recording  of  a  session 
with  SYSGRAS  for  the  purpose  of  reviewing  a  previous 
session.  The  user  roust  take  care  to  rename  the  file  which 
contains  the  data  from  previous  session  into  a  file  under 
the  name  SYSARRAY. MEM  .  This  file  must  be  the  most  recent 
version  generated  by  the  VAX  VMS  Operating  System. 

The  following  files  with  data  from  previous  sessions  are 
already  available,  and  may  be  run  upon  request  freir  the 
user  : 

file  DIVDIFF. MEK,   with  recording   of  simulation  anal- 
ysed in  Chapter  IV,  Section  A. 

file  MDLTPIY.MEM,   with  recording   of  simulation  anal- 
ysed in  Chapter  IV,  Section  3. 

file  GIVENSW.MEK,   with  recording   of  simulation  anal- 
ysed in  Chapter  IV,  Section  C. 
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-   file  BAKSUBS. MEH,   with  recording   of  simulation  anal- 
ysed in  Chapter  IV,  Section  D. 


D.   RECORDING  OF  COHPDTED  DATA  AND  PRINTING  ODT 

If  the  user  has  asked  for  this  facility,  SYSGEAS  will 
write  all  computed  results  of  each  cell  at  each  clock  cycle 
into  a  file  named  SYSPRINT.DOC,  which  can  be  printed  out 
after  the  end  of  the  session. 
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71.  TEE  SIMULATOR  MAINTENANCE 

A.  STBOCTUHE  OF  THE  SIMULATOR 

The  SYSGRAS  has  teen  designed  in  two  layers.  The  upper 
layer  that  interacts  with  the  user  is  in  PASCAL,  and  so  it 
is  guite  portable.  In  fact,  due  to  the  use  of  the 
"OTHERWISE"  feature  offered  by  the  "CASE"  command  in  the  VAX 
PASCAL  COMPILER,  a  few  of  its  case  structures  have  to  be 
slightly  modified  before  it  can  be  installed  in  another 
machine.  The  lower  layer  is  constructed  in  FORTRAN  77  and 
it  is  compiled  with  VAX  FORTRAN.  This  layer  interfaces  with 
the  Graphics  Package  SIGGRAPH  that  is  presently  available  at 
NPS  on  the  VAX  750  machine.  It  has  been  set  up  for  presenta- 
tion en  the  RAMTEK  9400  system,  and  so  it  is  not  portable. 
Its  structure,  however,  can  be  modified  to  interface  with 
another  graphics  package  with  characteristics  similar  to 
those  of  SIGGRAPH. 

The  PASCAL  layer  is  presented  in  Appendix  A  and  the 
FORTRAN  layer  is  in  Appendix  B. 

B.  MODULAR  DESIGB  ASPECTS 

The  design  of  SYSGRAS  has  followed  the  philosophy  of 
modular  design.  The  data  structure  has  been  designed  to 
match  the  abstract  idea  of  a  systolic  array  and  its  nultijle 
features,  subsequently  it  can  be  easily  identifiable  by  the 
program  user.  The  basic  element  is  the  record  "NODE".  This 
element  keeps  all  the  information  regarding  each  cell.  The 
whole  array  is  a  collection  of  NODES's,  and  these  are  assem- 
bled into  a  higher  level  hierarchy  by  the  record  "GRAPH" 
which  contains  all  information  about  the  whole  array  at  each 
clock  cycle. 
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The  part  of  the  program  that  is  most  relevant  tc  the 
user  ref€rs  to  the  cell  processor  support  routines.  If  new 
types  of  cells  need  tc  be  simulated,  new  support  subroutines 
must  he  added  to  the  existing  set.  This  can  be  done  without 
complete  knowledge  of  the  program  implementation  because  of 
the  modularity  of  the  design.  Now  we  will  refer  specifi- 
cally to  this  kind  of  software  maintenance. 

C-   DESIGNING   AND    IHSTALLING   A   NEH    PROCESSOR   SOEPORT 
SUBROUTINE 

The  cell  processor  support  subroutines  modify  the  global 
data  structure  repres€nted  by  the  variable  G  of  type  GRAPH. 
The  existing  set  of  such  subroutines  are  under  the  comment 
titled  "library  routines  for  cell  processors"  in  Appendix  A- 
If  a  new  one  needs  to  be  created  and  added,  it  should 
conform  with  the  pattern  seen  in  examples.  The  subroutine 
implementation  must  te  placed  between  the  statement  "with 
G.NODES(.I.)  do  begin"  and  the  statement 
"COLCE_£IOCESSING (G,I) ".  This  last  command  calls  a  subrou- 
tine that  will  compute  the  color  of  the  cell  as  a  result  of 
the  combination  of  the  different  primary  colors  of  the  input 
data. 

The  following  steps  must,  therefore,  be  followed  to 
introduce  a  new  cell  subroutine: 

i.  increase  the  constant  N0MBER_OF_ROUTINES . 

ii.  modify  the  enumerated  type  PROCESSOR_TYPE  to  include 
another  routine  reference  name  (e.g.  F0UTINE_9)  . 

iii.  modify  procedure  DEFINE_NODES  to  include  references 
to  the  new  subroutine.  This  should  be  done  in  such  a 
way  that  the  new  routine  is  referred  the  same  way  as 
the  others. 

iv.  modify  procedure  OUTPUT_ARRAY  the  same  way  as  above. 
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v.  modify  procedure  CORRECT_NODE  as  above. 

vi.  modify  the   main  program  statement  "case   PROCESSOR 
cf"  as  above. 

vii.  place  the  body  of  the  new  procedure  next  to  the 
existing  procedures  to  conform  with  the  program 
structure. 

Hints  for  the  design  of  the  new  procedure: 

the  type  NODE  defines  the  cell  data  structure. 

-  the  number  of  input/output  ports  in  a  cell  car  be 
modified  by  changing  the  constant  CONNECTIONS  in  the 
main  program.  If  this  is  done,  the  constant  MAXIIKKS 
also  must  be  modified  as  instructed  in  the  program 
text  comments. 

The  present  implementation  of  SYSGRAS  restricts  the 
maximun  number  of  cells  in  an  array  to  23,  in  order  to 
achieve  a  faster  computational  speed.  If  desired,  that  can 
be  modified  by  altering  the  main  program  constant  MAXNCDES. 
If  this  is  done,  the  constant  MAXLINKS  must  also  be  modified 
as  instructed  in  the  program  text  comments. 
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VII.  CONCLUSIONS 

A.   ESTABLISHING  COHPABISON  PABAMETERS 

The  simulations  that  have  been  studied  in  Chapter  V 
provided  us  a  reasonable  tool  to  use  for  evaluation  of  algo- 
rithms implemented  in  systolic  arrays.  Important  factors 
that  irust  he  considered  in  this  kind  of  an  evaluation  are: 

i.  use  of   the  same  type  of  processors  in   other  algo- 
rithms 

ii.  average  percentage   of  cells   involved  in   effective 
computation  per  clock  cycle 

iii.  degree  of  homogeneity  of  cells 

iv.  degree  of  the  usage  of  pipelining 

v.  possibility  of  modular  expandability 

vi.  required  number  of  external  connections  to  each  cell 

vii.  cell  complexity 

Unfortunately  not  all  these  factors  can  be  determined  by 
using  this  tool.  Sub jectiveness  also  has  to  play  a  role  in 
this  evaluation. 

E.   CONCLUSIONS  ABOUT  ALGOBITHMS  UNDER  ANALYSIS 

The  inplementaticn  of  an  algorithm  in  a  systolic  array 
may  provide  a  greater  calculation  speed,  but  a  cost  was  paid 
to  get  that  achievement. The  cost  can  be  evaluated  in  terms 
of  the  ccmplexity  of  the  design,  the  difficulty  in  inplemen- 
tation,  the  manufacture  problems  that  can  result  frcm  a 
particular  array  configuration,   etc.    Ke  want  to  make  sure 
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that  this  cost  is  not  excessive  and,  ideally,  to  be  minimum. 
These  are  considerations  that  require  criteria  of  evaluation 
to  be  established.  Instead  of  trying  to  come  up  with  a 
criteria  that  could  he  the  subject  of  lengthy  discussion, 
the  above  mentioned  factors  are  used  to  discuss  the  effi- 
ciency of  the  algorithms  with  a  more  qualitative  than  quan- 
titative approach. 

The  first  factor  we  want  to  establish  is  that  concerning 
the  average  percentage  of  cells  effectively  involved  in 
active  computation  per  clock  cycle  in  each  algorithm.  For 
each  algorithm  we  consider  the  number  of  required  cycles  to 
run  till  completion,  the  number  of  physical  cells  in  the 
systolic  array,  and  the  number  of  cells  involved  in  active 
computation  in  each  clock  cycle.  This  last  factor  corre- 
sponds to  the  number  of  cells  shown  on  the  screen  with  a 
color  other  than  black.  Thus  we  do  have: 

Matrix  Triangularization : 
number  of  cycles  =  9 
total  number  of  cells  =  9 
percentage  of  computation  /  cycle  = 

=  1/9  x  1/9  x  (1+2>4+5+6+5+3+1+0)  = 

=  0.333 

Performing  similar  calculations  for  the  other  algo- 
rithms, we  obtain  the  numbers  shown  up  in  Table  I. 
Eefinitely,  the  Matrix  Triangularization  has  a  longer  "duty 
cycle"  per  cell  and  this  means  less  waste. 

The  number  of  types  of  cells  required  for  each  algorithm 
is  easily  seen  in  the  pictures  shown  in  Chapter  IV.  Modular 
expandability  is  another  important  feature.  It  implies  the 
possibility  of  interconnecting  chips  (each  one  embedding  a 
whole  systolic  array)  in  a  cascade  fashion  or  seme  ether 
arrangement  where  the  chips  are  used  as  smaller  parts  of  a 
higher  hierarchical   arrangement.   Matrix   Triangularization 
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and  Back  Substitution  have  restrictions  in  that  respect  due 
to  the  fact  that  their  cell  arrangement  is  not  symmetrical 
and  such  an  expansion  would  require  different  types  of 
chips. 

T?ith  respect  to  these  factors,    we  can  summarize  in  the 
following  table; 


TABLE  III 

Comparison  of  Algorithm  Ieplementation  Evaluation 

Factors 


FEATURE 

MATRIX 
TRIANG. 

0.333 

2 

BACK 
SUBST. 

0.250 

2 

R 

DIV. 
DIFF. 

MATRIX 
MUIT. 

Average 
percentage  of 
duty  cycle 
computing 
per  cell 

Nuiber  of 
cell  types 

0.250 

1     i 

Y 

0.153 

1 

Modular 
expandability 

R 

Y 

R  =  restricted 
Y  =  yes 


The  number  of  external  connections  required  by  each  cell 
is  pehaps  the  most  demanding  factor  existing  in  pratical 
implementations.  The  normal  connections  are  power  and  clock 
lines.  Some  algorithms  like  Matrix  Multiplication  need  only 
these  two.  Some  others  have  synchronization  and  data  feeding 
problems  that  can  only  be  solved  with  the  addition  of  extra 
external  connections  to  the  cell.  This  feature  is  somehow 
related  to   the  degree  of   pipelining  that  can   be  achieved. 


101 


The  Matrix  Triangularization  and  the  Divided  Differences 
algorithms  can  be  pratically  implemented  if  there  is  a  flag 
signal  that  can  trigger  a  pumping  out  mechanism  to  send  the 
data  frozen  in  each  cell  after  the  last  clock  cycle.  To 
recover  the  data  and  display  it  externally,  additional 
connections  are  needed  in  each  cell.  This  increases  the 
total  ccst  of  the  chip  and  complicates  the  problem  of 
modular  expandability.  Since  a  design  goal  in  every  VLSI 
implementation  is  to  reduce  the  number  of  external  connec- 
tions, this  is  a  key  factor  in  the  design  effort. 
Pipelining  possibility  is  also  affected  because  the  pumping 
out  operation  is  not  cart  of  the  normal  algorithm  cycle.  It 
is  an  additional  burden  that  may  require  the  interruption  of 
the  data  pumping  into  and  the  pipelining  will  have  to  be 
restricted  to  data  of  the  same  problem.  Data  from  different 
problems  can  not  coexist  at  the  array  at  the  same  time.  A 
calculation  of  a  prctlem  has  to  finish  in  order  to  start 
another.  In  this  sense,  if  we  examine  the  Matrix 
Multiplication  algorithm,  we  will  conclude  that  true  pipe- 
lining can  be  achieved.  No  data  flow  interruption  occurs 
because  the  results  are  not  kept  frozen  in  the  cells,  hut 
are  moved  as  part  of  the  data  flow. 

..The  factor  of  cell  complexity  is  important  not  only 
because  of  the  possible  high  cost  of  the  hardware,  but  also 
because  of  the  time  that  may  be  required  for  a  clock  cycle 
to  be  executed.  If  a  longer  time  is  required,  the  clock 
speed  has  to  be  kept  low  and  the  efficiency  of  the 
processing  decreases.  Simple  operations  are  always  a  goal  on 
the  algorithm  design  because  they  result  simpler  and  faster 
hardware.  This  is  why  another  algorithm  for  Matrix 
Triangularization  without  Square  Roots  has  been  suggested. 
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C.  SUGGESTIONS  FOB  FUTURE  MODIFICATIONS  ON  SYSGRAS 

We  are  conscious  that  we  have  not  achieved  an  optimal 
design  with  our  simulator  in  terms  of  performance.  It  could 
be  improved  if  the  graphics  implementation  were  more  effi- 
cient. This  would  speed  up  the  interaction  with  the  user  and 
make  it  more  attractive.  Another  point  that  cculd  be 
improved  is  the  user  interface.  The  amount  of  information 
that  is  required  from  the  user  requires  an  effort  that  night 
be  minimized  if  other  interface  technigues  were  used,  such 
as  combined  use  of  mouse  or  lightpen  with  the  keyboard 
input. 

Additional  points  that  could  be  modified  to  enhance  the 
presentation  at  the  screen  are: 

-  Elimination  of  non  significant  zeros  from  the  cell 
display  (such  as  turning  000.400  into  0.4)  . 
Change  of  the  bidirectional  arrow  that  represents 
ccmmunications  in  both  directions  between  cells  into 
unidirected  arrows,  one  for  each  link  between  cell 
ports.  This  would  make  the  cells  appear  on  the  screen 
the  same  way  they  are  represented  in  the  literature. 
An  example  of  the  birectional  arrow  can  be  seen  in 
Fig.  4.36,  in  Chapter  IV,  connecting  cells  05  and  04, 
as  well  as  04  ard  02. 

D.  TEE  BOLE  OF  THE  SIMULATOR 

Our  goal  was  to  contribute  to  the  understanding  cf  the 
implementation  of  algorithms  in  systolic  arrays.  This  led  us 
to  the  realization  of  some  of  the  inherent  problems  that 
appear  in  this  field.  The  complexity  of  the  data  interac- 
tion in  the  space-time  frame  is  considered  to  be  the 
greatest  obstacle  in  the  understanding  process.  To  decide 
on  the  approach  to  adopt  to  handle  the  problem  is  also 
difficult.   In  response   to  that,   we  designed   a  tool  whcse 
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task  is  to  provide  a  software  implementation  of  the  systolic 
array.  The  user  is  left  free  to  concentrate  on  the  algo- 
rithm. He  have  shown  the  use  of  this  tool  on  the  study  of 
some  algorithms.  This  gave  us  the  opportunity  to  realize  the 
power  of  a  systolic  array  in  the  process  of  performing 
calculations.  Aspects  such  as  computation  time,  memory 
requirements  and  I/O  interactions  are  minimized.  The  mapping 
of  an  algorithm  onto  a  systolic  array  certainly  represents  a 
major  difficulty.  The  main  role  of  our  simulator  is  net  to 
help  in  this  mapping,  but  it  can  be  used  in  the  validation 
phase  of  the  algorithm  design.  The  greatest  contribution 
that  we  see  in  SYSGEAS  is  its  versatility.  The  user  can 
model  an  ideal  environment  for  the  algorithm  or  an  environ- 
ment contaminated  by  problems  in  the  underlying  hardware. 
This  certainly  can  telp  in  achieving  better  results  while 
presenting  a  clear  idea  of  the  robustness  of  an  algorithm. 
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